Biostat 203B Homework 3

Due Feb 21 @ 11:59PM

Author

Palash Raval and 406551574

Display machine information for reproducibility:

sessionInfo()
R version 4.3.1 (2023-06-16)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS 15.3

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] htmlwidgets_1.6.4 compiler_4.3.1    fastmap_1.2.0     cli_3.6.3        
 [5] tools_4.3.1       htmltools_0.5.8.1 rstudioapi_0.17.1 yaml_2.3.10      
 [9] rmarkdown_2.28    knitr_1.48        jsonlite_1.8.9    xfun_0.48        
[13] digest_0.6.37     rlang_1.1.4       evaluate_1.0.1   

Load necessary libraries (you can add more as needed).

library(arrow)
Warning: package 'arrow' was built under R version 4.3.3

Attaching package: 'arrow'
The following object is masked from 'package:utils':

    timestamp
library(gtsummary)
Warning: package 'gtsummary' was built under R version 4.3.3
library(memuse)
Warning: package 'memuse' was built under R version 4.3.3
library(pryr)

Attaching package: 'pryr'
The following object is masked from 'package:gtsummary':

    where
library(R.utils)
Loading required package: R.oo
Warning: package 'R.oo' was built under R version 4.3.3
Loading required package: R.methodsS3
R.methodsS3 v1.8.2 (2022-06-13 22:00:14 UTC) successfully loaded. See ?R.methodsS3 for help.
R.oo v1.27.0 (2024-11-01 18:00:02 UTC) successfully loaded. See ?R.oo for help.

Attaching package: 'R.oo'
The following object is masked from 'package:R.methodsS3':

    throw
The following objects are masked from 'package:methods':

    getClasses, getMethods
The following objects are masked from 'package:base':

    attach, detach, load, save
R.utils v2.12.3 (2023-11-18 01:00:02 UTC) successfully loaded. See ?R.utils for help.

Attaching package: 'R.utils'
The following object is masked from 'package:arrow':

    timestamp
The following object is masked from 'package:utils':

    timestamp
The following objects are masked from 'package:base':

    cat, commandArgs, getOption, isOpen, nullfile, parse, warnings
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ purrr::compose()      masks pryr::compose()
✖ lubridate::duration() masks arrow::duration()
✖ tidyr::extract()      masks R.utils::extract()
✖ dplyr::filter()       masks stats::filter()
✖ dplyr::lag()          masks stats::lag()
✖ purrr::partial()      masks pryr::partial()
✖ dplyr::where()        masks pryr::where(), gtsummary::where()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(duckdb)
Warning: package 'duckdb' was built under R version 4.3.3
Loading required package: DBI
Warning: package 'DBI' was built under R version 4.3.3

Display your machine memory.

memuse::Sys.meminfo()
Totalram:  16.000 GiB 
Freeram:    3.152 GiB 

In this exercise, we use tidyverse (ggplot2, dplyr, etc) to explore the MIMIC-IV data introduced in homework 1 and to build a cohort of ICU stays.

Q1. Visualizing patient trajectory

Visualizing a patient’s encounters in a health care system is a common task in clinical data analysis. In this question, we will visualize a patient’s ADT (admission-discharge-transfer) history and ICU vitals in the MIMIC-IV data.

Q1.1 ADT history

A patient’s ADT history records the time of admission, discharge, and transfer in the hospital. This figure shows the ADT history of the patient with subject_id 10001217 in the MIMIC-IV data. The x-axis is the calendar time, and the y-axis is the type of event (ADT, lab, procedure). The color of the line segment represents the care unit. The size of the line segment represents whether the care unit is an ICU/CCU. The crosses represent lab events, and the shape of the dots represents the type of procedure. The title of the figure shows the patient’s demographic information and the subtitle shows top 3 diagnoses.

Hint: We need to pull information from data files patients.csv.gz, admissions.csv.gz, transfers.csv.gz, labevents.csv.gz, procedures_icd.csv.gz, diagnoses_icd.csv.gz, d_icd_procedures.csv.gz, and d_icd_diagnoses.csv.gz. For the big file labevents.csv.gz, use the Parquet format you generated in Homework 2. For reproducibility, make the Parquet folder labevents_pq available at the current working directory hw3, for example, by a symbolic link. Make your code reproducible.

demographic = read_csv("~/mimic/hosp/patients.csv.gz", 
                       show_col_types = FALSE) %>%
  filter(subject_id == 10001217)
adt = read_csv("~/mimic/hosp/admissions.csv.gz",
               show_col_types = FALSE) %>%
  filter(subject_id == 10001217)
color_info = read_csv("~/mimic/hosp/transfers.csv.gz",
                      show_col_types = FALSE) %>%
  filter(subject_id == 10001217)
shape_info = read_csv("~/mimic/hosp/procedures_icd.csv.gz",
                      show_col_types = FALSE) %>%
  filter(subject_id == 10001217)
labevents_subset = open_dataset("labevents_pq") %>%
  filter(subject_id == 10001217) %>%
  collect()
diagnoses = read_csv("~/mimic/hosp/diagnoses_icd.csv.gz", 
                     show_col_types = FALSE)
diagnoses = diagnoses %>% 
  filter(subject_id == 10001217) %>%
  head(3)
diagnoses_codes = read_csv("~/mimic/hosp/d_icd_diagnoses.csv.gz",
                           show_col_types = FALSE)
diagnoses_codes = diagnoses_codes %>% 
  filter(icd_code %in% diagnoses$icd_code)
ggplot() +
  geom_segment(data = adt, aes(x = admittime, xend = dischtime, y = "lab"),
               linewidth = 0.1) +
  geom_point(data = labevents_subset, aes(x = charttime, y = "lab"), 
             shape = 3, size = 2) + 
  geom_segment(data = adt, aes(x = admittime, xend = dischtime, y = "adt")) + 
  geom_segment(data = color_info, aes(x = intime, xend = outtime, y = "adt", 
                                      colour = careunit, 
                                      linewidth = careunit)) + 
  geom_point(data = shape_info, aes(x = as.POSIXct(chartdate), y = "procedure",
                                    shape = icd_code),
             size = 4) + 
  scale_shape_manual(values = c("0139" = 17,
                                "0331" = 15,
                                "3897" = 16),
                     labels = c("0139" = "Other incision of brain",
                                "0331" = "Spinal Tap",
                                "3897" = "Central venous catheter
                                placement with guidance"),
                     name = "Procedure") + 
  labs(title = paste0("Patient ", demographic$subject_id, ", ",  
                     demographic$gender, ", ", demographic$anchor_age, 
                     " years old ", tolower(adt$race)),
       subtitle = paste(diagnoses_codes$long_title[1],
                        diagnoses_codes$long_title[2],
                        diagnoses_codes$long_title[3],
                        sep = "\n"),
       x = "Calendar Time",
       y = "")

Do a similar visualization for the patient with subject_id 10063848 using ggplot.

demographic = read_csv("~/mimic/hosp/patients.csv.gz", 
                       show_col_types = FALSE) %>%
  filter(subject_id == 10063848)
adt = read_csv("~/mimic/hosp/admissions.csv.gz", 
               show_col_types = FALSE) %>%
  filter(subject_id == 10063848)
color_info = read_csv("~/mimic/hosp/transfers.csv.gz", 
                      show_col_types = FALSE) %>%
  filter(subject_id == 10063848)
shape_info = read_csv("~/mimic/hosp/procedures_icd.csv.gz",
                      show_col_types = FALSE) %>%
  filter(subject_id == 10063848)
procedures = read_csv("~/mimic/hosp/d_icd_procedures.csv.gz",
                      show_col_types = FALSE) 
procedures = procedures %>% 
  filter(icd_code %in% shape_info$icd_code)

procedures
# A tibble: 5 × 3
  icd_code icd_version long_title                                               
  <chr>          <dbl> <chr>                                                    
1 02HV33Z           10 Insertion of Infusion Device into Superior Vena Cava, Pe…
2 0DB80ZZ           10 Excision of Small Intestine, Open Approach               
3 0DN80ZZ           10 Release Small Intestine, Open Approach                   
4 0W9G30Z           10 Drainage of Peritoneal Cavity with Drainage Device, Perc…
5 4A023N6           10 Measurement of Cardiac Sampling and Pressure, Right Hear…
labevents_subset = open_dataset("labevents_pq") %>%
  filter(subject_id == 10063848) %>%
  collect()
diagnoses = read_csv("~/mimic/hosp/diagnoses_icd.csv.gz",
                     show_col_types = FALSE) %>%
  filter(subject_id == 10063848) %>%
  head(3)
diagnoses_codes = read_csv("~/mimic/hosp/d_icd_diagnoses.csv.gz",
                           show_col_types = FALSE)
diagnoses_codes = diagnoses_codes %>% 
  filter(icd_code %in% diagnoses$icd_code)
ggplot() +
  geom_segment(data = adt, aes(x = admittime, xend = dischtime, y = "lab"),
               linewidth = 0.1) +
  geom_point(data = labevents_subset, aes(x = charttime, y = "lab"), 
             shape = 3, size = 2) + 
  geom_segment(data = adt, aes(x = admittime, xend = dischtime, y = "adt")) + 
  geom_segment(data = color_info, aes(x = intime, xend = outtime, y = "adt", 
                                      colour = careunit, linewidth = careunit), 
               ) + 
  geom_point(data = shape_info, aes(x = as.POSIXct(chartdate), y = "procedure",
                                    shape = icd_code), size = 4) + 
  scale_shape_manual(values = c("02HV33Z" = 17,
                                "0DB80ZZ" = 15,
                                "0DN80ZZ" = 16,
                                "0W9G30Z" = 14,
                                "4A023N6" = 13),
                     labels = c("02HV33Z" = 
                                  "Insertion of Infusion Device 
                                  into Superior Vena Cava, Percutaneous 
                                  Approach",
                                "0DB80ZZ" = 
                                  "Excision of Small Intestine, Open Approach",
                                "0DN80ZZ" = 
                                  "Release Small Intestine, Open Approach",
                                "0W9G30Z" = 
                                  "Drainage of Peritoneal Cavity with 
                                Drainage Device, Percutaneous Approach",
                                "4A023N6" = 
                                 "Measurement of Cardiac Sampling and Pressure, 
                                 Right Heart, Percutaneous Approach")) + 
  labs(title = paste0("Patient ", demographic$subject_id, ", ",  
                     demographic$gender, ", ", demographic$anchor_age, 
                     " years old ", tolower(adt$race)),
       subtitle = paste(diagnoses_codes$long_title[1],
                        diagnoses_codes$long_title[2],
                        diagnoses_codes$long_title[3],
                        sep = "\n"),
       shape = "Procedure",
       x = "Calendar Time",
       y = "") + 
  theme(legend.position = "right", 
        legend.box.margin = margin(t = 15))

rm(list = ls())

Q1.2 ICU stays

ICU stays are a subset of ADT history. This figure shows the vitals of the patient 10001217 during ICU stays. The x-axis is the calendar time, and the y-axis is the value of the vital. The color of the line represents the type of vital. The facet grid shows the abbreviation of the vital and the stay ID.

chartevents_subset = open_dataset("chartevents_pq") %>%
  filter(subject_id == 10001217) %>%
  collect()
items = read_csv("~/mimic/icu/d_items.csv.gz", 
                 show_col_types = FALSE)

items = items %>% filter(abbreviation %in% c("HR", "NBPd","NBPs","RR",
                                             "Temperature F"))
chartevents_subset = chartevents_subset %>% 
  filter(itemid %in% items$itemid)
labels = c("220045" = "HR",
           "220179" = "NBPd",
           "220180" = "NBPs",
           "220210" = "RR",
           "223761" = "Temperature F")

ggplot(data = chartevents_subset) + 
  geom_line(mapping = aes(x = charttime, y = valuenum, 
                          colour = as.factor(itemid))) +
  geom_point(mapping = aes(x = charttime, y = valuenum, 
                          colour = as.factor(itemid))) + 
  facet_grid(itemid~stay_id, scales = "free", 
             labeller = labeller(itemid = as_labeller(labels))) + 
  scale_x_datetime(date_breaks = "6 hours",
                   date_labels = "%Y-%m-%d\n%H:%M") + 
  theme(axis.text.x = element_text(size = 6), legend.position = "None") +
  labs(title = paste("Patient", chartevents_subset$subject_id, 
                     "ICU stays - Vitals"),
       x = "", 
       y = "")

Do a similar visualization for the patient 10063848.

chartevents_subset = open_dataset("chartevents_pq") %>%
  filter(subject_id == 10063848) %>%
  collect()
chartevents_subset = chartevents_subset %>% 
  filter(itemid %in% items$itemid)
labels = c("220045" = "HR",
           "220179" = "NBPd",
           "220180" = "NBPs",
           "220210" = "RR",
           "223761" = "Temperature F")

ggplot(data = chartevents_subset) + 
  geom_line(mapping = aes(x = charttime, y = valuenum, 
                          colour = as.factor(itemid))) +
  geom_point(mapping = aes(x = charttime, y = valuenum, 
                          colour = as.factor(itemid))) + 
  facet_grid(itemid~stay_id, scales = "free_y", 
             labeller = labeller(itemid = as_labeller(labels))) + 
  scale_x_datetime(date_breaks = "10 hours",
                   date_labels = "%Y-%m-%d\n%H:%M") + 
  theme(axis.text.x = element_text(size = 5), legend.position = "None") +
  labs(title = paste("Patient", chartevents_subset$subject_id, 
                     "ICU stays - Vitals"),
       x = "", 
       y = "")

rm(list = ls())

Q2. ICU stays

icustays.csv.gz (https://mimic.mit.edu/docs/iv/modules/icu/icustays/) contains data about Intensive Care Units (ICU) stays. The first 10 lines are

zcat < ~/mimic/icu/icustays.csv.gz | head
subject_id,hadm_id,stay_id,first_careunit,last_careunit,intime,outtime,los
10000032,29079034,39553978,Medical Intensive Care Unit (MICU),Medical Intensive Care Unit (MICU),2180-07-23 14:00:00,2180-07-23 23:50:47,0.4102662037037037
10000690,25860671,37081114,Medical Intensive Care Unit (MICU),Medical Intensive Care Unit (MICU),2150-11-02 19:37:00,2150-11-06 17:03:17,3.8932523148148146
10000980,26913865,39765666,Medical Intensive Care Unit (MICU),Medical Intensive Care Unit (MICU),2189-06-27 08:42:00,2189-06-27 20:38:27,0.4975347222222222
10001217,24597018,37067082,Surgical Intensive Care Unit (SICU),Surgical Intensive Care Unit (SICU),2157-11-20 19:18:02,2157-11-21 22:08:00,1.1180324074074075
10001217,27703517,34592300,Surgical Intensive Care Unit (SICU),Surgical Intensive Care Unit (SICU),2157-12-19 15:42:24,2157-12-20 14:27:41,0.948113425925926
10001725,25563031,31205490,Medical/Surgical Intensive Care Unit (MICU/SICU),Medical/Surgical Intensive Care Unit (MICU/SICU),2110-04-11 15:52:22,2110-04-12 23:59:56,1.338587962962963
10001843,26133978,39698942,Medical/Surgical Intensive Care Unit (MICU/SICU),Medical/Surgical Intensive Care Unit (MICU/SICU),2134-12-05 18:50:03,2134-12-06 14:38:26,0.8252662037037037
10001884,26184834,37510196,Medical Intensive Care Unit (MICU),Medical Intensive Care Unit (MICU),2131-01-11 04:20:05,2131-01-20 08:27:30,9.17181712962963
10002013,23581541,39060235,Cardiac Vascular Intensive Care Unit (CVICU),Cardiac Vascular Intensive Care Unit (CVICU),2160-05-18 10:00:53,2160-05-19 17:33:33,1.314351851851852

Q2.1 Ingestion

Import icustays.csv.gz as a tibble icustays_tble.

icustays_tble = read_csv("~/mimic/icu/icustays.csv.gz")
Rows: 94458 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): first_careunit, last_careunit
dbl  (4): subject_id, hadm_id, stay_id, los
dttm (2): intime, outtime

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Q2.2 Summary and visualization

How many unique subject_id? Can a subject_id have multiple ICU stays? Summarize the number of ICU stays per subject_id by graphs.

length(unique(icustays_tble$subject_id))
[1] 65366
icustays_tble %>% count(subject_id, sort = TRUE)
# A tibble: 65,366 × 2
   subject_id     n
        <dbl> <int>
 1   12468016    41
 2   18358138    37
 3   17585185    34
 4   17295976    31
 5   13269859    30
 6   18676703    27
 7   12517625    26
 8   11281568    25
 9   15229355    25
10   15455517    25
# ℹ 65,356 more rows
icu_counts = icustays_tble %>% 
  count(subject_id, sort = TRUE) %>% 
  head(15)
ggplot(data = icu_counts, aes(x = as.factor(subject_id), 
                              y = n,
                              fill = as.factor(subject_id))) +
  geom_bar(stat = "identity", color = "lightblue") +
  labs(title = "Top 15 Subjects with Most ICU Stays",
       x = "Subject ID",
       y = "ICU Stays Count",
       fill = "Subject ID Legend") + 
  theme(axis.text.x = element_text(angle = 90))

icu_counts = icustays_tble %>% 
  count(subject_id, sort = TRUE)
ggplot(icu_counts, aes(x = n)) + 
  geom_histogram() +
  labs(title = "Distribution of Counts of ICU Stays",
       x = "Counts",
       y = "Number of Subject ID in Log Scale") + 
  scale_y_continuous(trans = "log10")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning in scale_y_continuous(trans = "log10"): log-10 transformation
introduced infinite values.
Warning: Removed 7 rows containing missing values or values outside the scale range
(`geom_bar()`).

Solution: There are 65366 unique values for subject_id. Yes, a subject_id can have multiple ICU stays. Each row represents one ICU stay, so any count of subject_id that is greater than 1 indicates more than one ICU stay for a particular subject.

Q3. admissions data

Information of the patients admitted into hospital is available in admissions.csv.gz. See https://mimic.mit.edu/docs/iv/modules/hosp/admissions/ for details of each field in this file. The first 10 lines are

zcat < ~/mimic/hosp/admissions.csv.gz | head
subject_id,hadm_id,admittime,dischtime,deathtime,admission_type,admit_provider_id,admission_location,discharge_location,insurance,language,marital_status,race,edregtime,edouttime,hospital_expire_flag
10000032,22595853,2180-05-06 22:23:00,2180-05-07 17:15:00,,URGENT,P49AFC,TRANSFER FROM HOSPITAL,HOME,Medicaid,English,WIDOWED,WHITE,2180-05-06 19:17:00,2180-05-06 23:30:00,0
10000032,22841357,2180-06-26 18:27:00,2180-06-27 18:49:00,,EW EMER.,P784FA,EMERGENCY ROOM,HOME,Medicaid,English,WIDOWED,WHITE,2180-06-26 15:54:00,2180-06-26 21:31:00,0
10000032,25742920,2180-08-05 23:44:00,2180-08-07 17:50:00,,EW EMER.,P19UTS,EMERGENCY ROOM,HOSPICE,Medicaid,English,WIDOWED,WHITE,2180-08-05 20:58:00,2180-08-06 01:44:00,0
10000032,29079034,2180-07-23 12:35:00,2180-07-25 17:55:00,,EW EMER.,P06OTX,EMERGENCY ROOM,HOME,Medicaid,English,WIDOWED,WHITE,2180-07-23 05:54:00,2180-07-23 14:00:00,0
10000068,25022803,2160-03-03 23:16:00,2160-03-04 06:26:00,,EU OBSERVATION,P39NWO,EMERGENCY ROOM,,,English,SINGLE,WHITE,2160-03-03 21:55:00,2160-03-04 06:26:00,0
10000084,23052089,2160-11-21 01:56:00,2160-11-25 14:52:00,,EW EMER.,P42H7G,WALK-IN/SELF REFERRAL,HOME HEALTH CARE,Medicare,English,MARRIED,WHITE,2160-11-20 20:36:00,2160-11-21 03:20:00,0
10000084,29888819,2160-12-28 05:11:00,2160-12-28 16:07:00,,EU OBSERVATION,P35NE4,PHYSICIAN REFERRAL,,Medicare,English,MARRIED,WHITE,2160-12-27 18:32:00,2160-12-28 16:07:00,0
10000108,27250926,2163-09-27 23:17:00,2163-09-28 09:04:00,,EU OBSERVATION,P40JML,EMERGENCY ROOM,,,English,SINGLE,WHITE,2163-09-27 16:18:00,2163-09-28 09:04:00,0
10000117,22927623,2181-11-15 02:05:00,2181-11-15 14:52:00,,EU OBSERVATION,P47EY8,EMERGENCY ROOM,,Medicaid,English,DIVORCED,WHITE,2181-11-14 21:51:00,2181-11-15 09:57:00,0

Q3.1 Ingestion

Import admissions.csv.gz as a tibble admissions_tble.

admissions_tble = read_csv("~/mimic/hosp/admissions.csv.gz")
Rows: 546028 Columns: 16
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (8): admission_type, admit_provider_id, admission_location, discharge_l...
dbl  (3): subject_id, hadm_id, hospital_expire_flag
dttm (5): admittime, dischtime, deathtime, edregtime, edouttime

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(admissions_tble)
# A tibble: 6 × 16
  subject_id hadm_id admittime           dischtime           deathtime
       <dbl>   <dbl> <dttm>              <dttm>              <dttm>   
1   10000032  2.26e7 2180-05-06 22:23:00 2180-05-07 17:15:00 NA       
2   10000032  2.28e7 2180-06-26 18:27:00 2180-06-27 18:49:00 NA       
3   10000032  2.57e7 2180-08-05 23:44:00 2180-08-07 17:50:00 NA       
4   10000032  2.91e7 2180-07-23 12:35:00 2180-07-25 17:55:00 NA       
5   10000068  2.50e7 2160-03-03 23:16:00 2160-03-04 06:26:00 NA       
6   10000084  2.31e7 2160-11-21 01:56:00 2160-11-25 14:52:00 NA       
# ℹ 11 more variables: admission_type <chr>, admit_provider_id <chr>,
#   admission_location <chr>, discharge_location <chr>, insurance <chr>,
#   language <chr>, marital_status <chr>, race <chr>, edregtime <dttm>,
#   edouttime <dttm>, hospital_expire_flag <dbl>

Q3.2 Summary and visualization

Summarize the following information by graphics and explain any patterns you see.

  • number of admissions per patient
  • admission hour (anything unusual?)
  • admission minute (anything unusual?)
  • length of hospital stay (from admission to discharge) (anything unusual?)

According to the MIMIC-IV documentation,

All dates in the database have been shifted to protect patient confidentiality. Dates will be internally consistent for the same patient, but randomly distributed in the future. Dates of birth which occur in the present time are not true dates of birth. Furthermore, dates of birth which occur before the year 1900 occur if the patient is older than 89. In these cases, the patient’s age at their first admission has been fixed to 300.

Solution: Graph for “Number of Admissions per Patient”

admission_per_patient = admissions_tble %>% 
  group_by(subject_id) %>%
  summarize(admission_counts = n_distinct(hadm_id)) %>%
  arrange(desc(admission_counts)) %>%
  head(10)
ggplot(data = admission_per_patient, aes(as.factor(subject_id), 
                              y = admission_counts,
                              fill = as.factor(subject_id))) +
  geom_bar(stat = "identity", color = "green") +
  labs(title = "Top 10 Subjects with Most Admissions",
       x = "Subject ID",
       y = "Admissions Count",
       fill = "Subject ID Legend") + 
  theme(axis.text.x = element_text(angle = 90))

Solution: Graph for “admission hour”

ggplot(admissions_tble, aes(x = hour(admittime))) + 
  geom_histogram(bins = 24, fill = "orangered") + 
  labs(title = "Admission Time Hour Distribution",
       x = "Hour of Admission",
       y = "Count") + 
  theme(legend.position = "NONE")

This graph shows that there seems to be the peak admission hour is at midnight(start of a new day). This doesn’t make much sense because I would expect the highest admission to happen during the morning or in the middle of the day, because that is when the most people are awake and would seek treatment. My best guess for why this could be happening is that the times must be when a large amount of data for the patients is added into the database. Midnight, the start of a new day, may be the time that a portion of data from the previous day is added.

Solution: Graph for “admission minute”

ggplot(admissions_tble, aes(x = minute(admittime))) + 
  geom_histogram(bins = 15, fill = "royalblue") + 
  labs(title = "Admission Time Hour Distribution",
       x = "Minute of Admission",
       y = "Count") + 
  theme(legend.position = "NONE")

This graph shows that the highest minute of admission is at minute 0. I believe this constitutes as unusual because I would expect the distribution of minute to be relatively uniform. The reason for this could be tied with what I mentioned earlier about the data being added at a particular time (midnight). If it is documented at exactly midnight, the data would tend to have a larger portion of minute values being 0.

Solution: Graph for “length of hospital stay (from admission to discharge)”

time_difference_sec = admissions_tble$dischtime - admissions_tble$admittime

admissions_tble$time_difference_hour = time_difference_sec/3600
ggplot(admissions_tble, aes(x = as.numeric(time_difference_hour))) + 
  geom_histogram(bins = 15, fill = "purple") + 
  labs(title = "Distribution for Length of Hospital Stay",
       x = "Time Difference in Hours",
       y = "Count")

This graph shows that the majority of values after subtracting the discharge time from the admission time around 0 hours. This does not make sense as you would expect a trip to the hospital to take quite a while, so I would say this is unusual. The dates in the MIMIC IV are known to be shifted to protect patients, which could be why the times are so unusual.

Q4. patients data

Patient information is available in patients.csv.gz. See https://mimic.mit.edu/docs/iv/modules/hosp/patients/ for details of each field in this file. The first 10 lines are

zcat < ~/mimic/hosp/patients.csv.gz | head
subject_id,gender,anchor_age,anchor_year,anchor_year_group,dod
10000032,F,52,2180,2014 - 2016,2180-09-09
10000048,F,23,2126,2008 - 2010,
10000058,F,33,2168,2020 - 2022,
10000068,F,19,2160,2008 - 2010,
10000084,M,72,2160,2017 - 2019,2161-02-13
10000102,F,27,2136,2008 - 2010,
10000108,M,25,2163,2014 - 2016,
10000115,M,24,2154,2017 - 2019,
10000117,F,48,2174,2008 - 2010,

Q4.1 Ingestion

Import patients.csv.gz (https://mimic.mit.edu/docs/iv/modules/hosp/patients/) as a tibble patients_tble.

patients_tble = read_csv("~/mimic/hosp/patients.csv.gz")
Rows: 364627 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): gender, anchor_year_group
dbl  (3): subject_id, anchor_age, anchor_year
date (1): dod

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(patients_tble)
# A tibble: 6 × 6
  subject_id gender anchor_age anchor_year anchor_year_group dod       
       <dbl> <chr>       <dbl>       <dbl> <chr>             <date>    
1   10000032 F              52        2180 2014 - 2016       2180-09-09
2   10000048 F              23        2126 2008 - 2010       NA        
3   10000058 F              33        2168 2020 - 2022       NA        
4   10000068 F              19        2160 2008 - 2010       NA        
5   10000084 M              72        2160 2017 - 2019       2161-02-13
6   10000102 F              27        2136 2008 - 2010       NA        

Q4.2 Summary and visualization

Summarize variables gender and anchor_age by graphics, and explain any patterns you see.

patients_tble %>% count(gender)
# A tibble: 2 × 2
  gender      n
  <chr>   <int>
1 F      191984
2 M      172643
ggplot(data = patients_tble, aes(x = as.factor(gender),
                                 fill = as.factor(gender))) + 
  geom_bar(width = 0.4) + 
  scale_fill_manual(values = c("pink", "lightblue")) + 
  labs(title = "Patient Gender Count", 
       x = "Gender",
       y = "Count",
       fill = "Gender")

This graph indicates that females make up the majority of the patients, although the gender distribution doesn’t seem to be overwhemingly one-sided.

ggplot(patients_tble, aes(x = anchor_age)) + 
  geom_histogram(bins = 20, fill = "gold") + 
  labs(title = "Distribution of Anchor Age",
       x = "Anchor Age", 
       y = "Count")

The anchor age with the highest counts seem to be under the age of 25. The distribution also appears to steadily decrease as the anchor age increases. I would have expected patients to be of older ages to make up the majority of patients, but from the graph, does not seem to be the case. One possible reason for this could be that the anchor ages are not accurate and are shifted to protect the patients’ privacy.

Q5. Lab results

labevents.csv.gz (https://mimic.mit.edu/docs/iv/modules/hosp/labevents/) contains all laboratory measurements for patients. The first 10 lines are

zcat < ~/mimic/hosp/labevents.csv.gz | head

d_labitems.csv.gz (https://mimic.mit.edu/docs/iv/modules/hosp/d_labitems/) is the dictionary of lab measurements.

zcat < ~/mimic/hosp/d_labitems.csv.gz | head
itemid,label,fluid,category
50801,Alveolar-arterial Gradient,Blood,Blood Gas
50802,Base Excess,Blood,Blood Gas
50803,"Calculated Bicarbonate, Whole Blood",Blood,Blood Gas
50804,Calculated Total CO2,Blood,Blood Gas
50805,Carboxyhemoglobin,Blood,Blood Gas
50806,"Chloride, Whole Blood",Blood,Blood Gas
50808,Free Calcium,Blood,Blood Gas
50809,Glucose,Blood,Blood Gas
50810,"Hematocrit, Calculated",Blood,Blood Gas

We are interested in the lab measurements of creatinine (50912), potassium (50971), sodium (50983), chloride (50902), bicarbonate (50882), hematocrit (51221), white blood cell count (51301), and glucose (50931). Retrieve a subset of labevents.csv.gz that only containing these items for the patients in icustays_tble. Further restrict to the last available measurement (by storetime) before the ICU stay. The final labevents_tble should have one row per ICU stay and columns for each lab measurement.

Hint: Use the Parquet format you generated in Homework 2. For reproducibility, make labevents_pq folder available at the current working directory hw3, for example, by a symbolic link.

important_measurements = c(50912, 50971, 50983, 50902, 50882, 51221,
                           51301, 50931)

labevents_subset = open_dataset("labevents_pq") %>%
  to_duckdb() %>%
  filter(itemid %in% important_measurements) %>%
  filter(subject_id %in% icustays_tble$subject_id) %>%
  arrange(subject_id) %>%
  collect()
labevents_subset = left_join(labevents_subset, icustays_tble,
                   by = "subject_id")
labevents_subset = labevents_subset %>%
  filter(storetime < intime)
labevents_subset = labevents_subset %>% 
  group_by(subject_id, stay_id, itemid) %>%
  slice_max(storetime, with_ties = FALSE) %>%
  select(subject_id, stay_id, itemid, valuenum)
labevents_tble = labevents_subset %>% 
  pivot_wider(names_from = itemid, values_from = valuenum)
colnames(labevents_tble) = c("subject_id",
                               "stay_id",
                               "bicarbonate", 
                               "chloride",
                               "creatinine",
                               "glucose",
                               "potassium",
                               "sodium",
                               "hematocrit",
                               "white_blood_cell_count")

head(labevents_tble, 10)
# A tibble: 10 × 10
# Groups:   subject_id, stay_id [10]
   subject_id  stay_id bicarbonate chloride creatinine glucose potassium sodium
        <dbl>    <dbl>       <dbl>    <dbl>      <dbl>   <dbl>     <dbl>  <dbl>
 1   10000032 39553978          25       95        0.7     102       6.7    126
 2   10000690 37081114          26      100        1        85       4.8    137
 3   10000980 39765666          21      109        2.3      89       3.9    144
 4   10001217 34592300          30      104        0.5      87       4.1    142
 5   10001217 37067082          22      108        0.6     112       4.2    142
 6   10001725 31205490          NA       98       NA        NA       4.1    139
 7   10001843 39698942          28       97        1.3     131       3.9    138
 8   10001884 37510196          30       88        1.1     141       4.5    130
 9   10002013 39060235          24      102        0.9     288       3.5    137
10   10002114 34672098          18       NA        3.1      95       6.5    125
# ℹ 2 more variables: hematocrit <dbl>, white_blood_cell_count <dbl>
rm(labevents_subset)

Q6. Vitals from charted events

chartevents.csv.gz (https://mimic.mit.edu/docs/iv/modules/icu/chartevents/) contains all the charted data available for a patient. During their ICU stay, the primary repository of a patient’s information is their electronic chart. The itemid variable indicates a single measurement type in the database. The value variable is the value measured for itemid. The first 10 lines of chartevents.csv.gz are

zcat < ~/mimic/icu/chartevents.csv.gz | head

d_items.csv.gz (https://mimic.mit.edu/docs/iv/modules/icu/d_items/) is the dictionary for the itemid in chartevents.csv.gz.

zcat < ~/mimic/icu/d_items.csv.gz | head
itemid,label,abbreviation,linksto,category,unitname,param_type,lownormalvalue,highnormalvalue
220001,Problem List,Problem List,chartevents,General,,Text,,
220003,ICU Admission date,ICU Admission date,datetimeevents,ADT,,Date and time,,
220045,Heart Rate,HR,chartevents,Routine Vital Signs,bpm,Numeric,,
220046,Heart rate Alarm - High,HR Alarm - High,chartevents,Alarms,bpm,Numeric,,
220047,Heart Rate Alarm - Low,HR Alarm - Low,chartevents,Alarms,bpm,Numeric,,
220048,Heart Rhythm,Heart Rhythm,chartevents,Routine Vital Signs,,Text,,
220050,Arterial Blood Pressure systolic,ABPs,chartevents,Routine Vital Signs,mmHg,Numeric,90,140
220051,Arterial Blood Pressure diastolic,ABPd,chartevents,Routine Vital Signs,mmHg,Numeric,60,90
220052,Arterial Blood Pressure mean,ABPm,chartevents,Routine Vital Signs,mmHg,Numeric,,

We are interested in the vitals for ICU patients: heart rate (220045), systolic non-invasive blood pressure (220179), diastolic non-invasive blood pressure (220180), body temperature in Fahrenheit (223761), and respiratory rate (220210). Retrieve a subset of chartevents.csv.gz only containing these items for the patients in icustays_tble. Further restrict to the first vital measurement (by storetime) within the ICU stay. The final chartevents_tble should have one row per ICU stay and columns for each vital measurement.

Hint: Use the Parquet format you generated in Homework 2. For reproducibility, make chartevents_pq folder available at the current working directory, for example, by a symbolic link.

important_vitals = c(220045, 220179, 220180, 223761, 220210)

chartevents_subset = open_dataset("chartevents_pq") %>%
  to_duckdb() %>%
  filter(itemid %in% important_vitals) %>%
  filter(subject_id %in% icustays_tble$subject_id) %>%
  arrange(subject_id) %>%
  collect()
chartevents_subset = chartevents_subset %>% 
  group_by(subject_id, stay_id, itemid) %>%
  slice_min(storetime) %>%
  select(subject_id, stay_id, itemid, valuenum)
chartevents_subset = chartevents_subset %>% 
  group_by(subject_id, stay_id, itemid) %>% 
  summarize(valuenum = round(mean(valuenum), 
           digits = 1), .groups = "drop")
chartevents_tble = chartevents_subset %>% 
  pivot_wider(names_from = itemid, values_from = valuenum)
colnames(chartevents_tble) = c("subject_id",
                               "stay_id",
                               "heart_rate", 
                               "systolic_non-invasive_blood_pressure",
                               "diastolic_non-invasive_blood_pressure",
                               "respiratory rate",
                               "body_temperature_fahrenheit")

head(chartevents_tble)
# A tibble: 6 × 7
  subject_id  stay_id heart_rate systolic_non-invasive_…¹ diastolic_non-invasi…²
       <dbl>    <dbl>      <dbl>                    <dbl>                  <dbl>
1   10000032 39553978       91                         84                   48  
2   10000690 37081114       78                        106                   56.5
3   10000980 39765666       76                        154                  102  
4   10001217 34592300       79.3                      156                   93.3
5   10001217 37067082       86                        151                   90  
6   10001725 31205490       86                         73                   56  
# ℹ abbreviated names: ¹​`systolic_non-invasive_blood_pressure`,
#   ²​`diastolic_non-invasive_blood_pressure`
# ℹ 2 more variables: `respiratory rate` <dbl>,
#   body_temperature_fahrenheit <dbl>

Q7. Putting things together

Let us create a tibble mimic_icu_cohort for all ICU stays, where rows are all ICU stays of adults (age at intime >= 18) and columns contain at least following variables

  • all variables in icustays_tble
  • all variables in admissions_tble
  • all variables in patients_tble
  • the last lab measurements before the ICU stay in labevents_tble
  • the first vital measurements during the ICU stay in chartevents_tble

The final mimic_icu_cohort should have one row per ICU stay and columns for each variable.

mimic_icu_cohort = left_join(icustays_tble, admissions_tble, 
                         by = c("subject_id", "hadm_id"))

head(mimic_icu_cohort)
# A tibble: 6 × 23
  subject_id  hadm_id  stay_id first_careunit  last_careunit intime             
       <dbl>    <dbl>    <dbl> <chr>           <chr>         <dttm>             
1   10000032 29079034 39553978 Medical Intens… Medical Inte… 2180-07-23 14:00:00
2   10000690 25860671 37081114 Medical Intens… Medical Inte… 2150-11-02 19:37:00
3   10000980 26913865 39765666 Medical Intens… Medical Inte… 2189-06-27 08:42:00
4   10001217 24597018 37067082 Surgical Inten… Surgical Int… 2157-11-20 19:18:02
5   10001217 27703517 34592300 Surgical Inten… Surgical Int… 2157-12-19 15:42:24
6   10001725 25563031 31205490 Medical/Surgic… Medical/Surg… 2110-04-11 15:52:22
# ℹ 17 more variables: outtime <dttm>, los <dbl>, admittime <dttm>,
#   dischtime <dttm>, deathtime <dttm>, admission_type <chr>,
#   admit_provider_id <chr>, admission_location <chr>,
#   discharge_location <chr>, insurance <chr>, language <chr>,
#   marital_status <chr>, race <chr>, edregtime <dttm>, edouttime <dttm>,
#   hospital_expire_flag <dbl>, time_difference_hour <drtn>
mimic_icu_cohort = left_join(mimic_icu_cohort, patients_tble, 
                         by = "subject_id")

head(mimic_icu_cohort)
# A tibble: 6 × 28
  subject_id  hadm_id  stay_id first_careunit  last_careunit intime             
       <dbl>    <dbl>    <dbl> <chr>           <chr>         <dttm>             
1   10000032 29079034 39553978 Medical Intens… Medical Inte… 2180-07-23 14:00:00
2   10000690 25860671 37081114 Medical Intens… Medical Inte… 2150-11-02 19:37:00
3   10000980 26913865 39765666 Medical Intens… Medical Inte… 2189-06-27 08:42:00
4   10001217 24597018 37067082 Surgical Inten… Surgical Int… 2157-11-20 19:18:02
5   10001217 27703517 34592300 Surgical Inten… Surgical Int… 2157-12-19 15:42:24
6   10001725 25563031 31205490 Medical/Surgic… Medical/Surg… 2110-04-11 15:52:22
# ℹ 22 more variables: outtime <dttm>, los <dbl>, admittime <dttm>,
#   dischtime <dttm>, deathtime <dttm>, admission_type <chr>,
#   admit_provider_id <chr>, admission_location <chr>,
#   discharge_location <chr>, insurance <chr>, language <chr>,
#   marital_status <chr>, race <chr>, edregtime <dttm>, edouttime <dttm>,
#   hospital_expire_flag <dbl>, time_difference_hour <drtn>, gender <chr>,
#   anchor_age <dbl>, anchor_year <dbl>, anchor_year_group <chr>, dod <date>
mimic_icu_cohort = left_join(mimic_icu_cohort, labevents_tble, 
                         by = c("subject_id", "stay_id"))

head(mimic_icu_cohort)
# A tibble: 6 × 36
  subject_id  hadm_id  stay_id first_careunit  last_careunit intime             
       <dbl>    <dbl>    <dbl> <chr>           <chr>         <dttm>             
1   10000032 29079034 39553978 Medical Intens… Medical Inte… 2180-07-23 14:00:00
2   10000690 25860671 37081114 Medical Intens… Medical Inte… 2150-11-02 19:37:00
3   10000980 26913865 39765666 Medical Intens… Medical Inte… 2189-06-27 08:42:00
4   10001217 24597018 37067082 Surgical Inten… Surgical Int… 2157-11-20 19:18:02
5   10001217 27703517 34592300 Surgical Inten… Surgical Int… 2157-12-19 15:42:24
6   10001725 25563031 31205490 Medical/Surgic… Medical/Surg… 2110-04-11 15:52:22
# ℹ 30 more variables: outtime <dttm>, los <dbl>, admittime <dttm>,
#   dischtime <dttm>, deathtime <dttm>, admission_type <chr>,
#   admit_provider_id <chr>, admission_location <chr>,
#   discharge_location <chr>, insurance <chr>, language <chr>,
#   marital_status <chr>, race <chr>, edregtime <dttm>, edouttime <dttm>,
#   hospital_expire_flag <dbl>, time_difference_hour <drtn>, gender <chr>,
#   anchor_age <dbl>, anchor_year <dbl>, anchor_year_group <chr>, dod <date>, …
mimic_icu_cohort = left_join(mimic_icu_cohort, chartevents_tble, 
                         by = c("subject_id", "stay_id"))

head(mimic_icu_cohort)
# A tibble: 6 × 41
  subject_id  hadm_id  stay_id first_careunit  last_careunit intime             
       <dbl>    <dbl>    <dbl> <chr>           <chr>         <dttm>             
1   10000032 29079034 39553978 Medical Intens… Medical Inte… 2180-07-23 14:00:00
2   10000690 25860671 37081114 Medical Intens… Medical Inte… 2150-11-02 19:37:00
3   10000980 26913865 39765666 Medical Intens… Medical Inte… 2189-06-27 08:42:00
4   10001217 24597018 37067082 Surgical Inten… Surgical Int… 2157-11-20 19:18:02
5   10001217 27703517 34592300 Surgical Inten… Surgical Int… 2157-12-19 15:42:24
6   10001725 25563031 31205490 Medical/Surgic… Medical/Surg… 2110-04-11 15:52:22
# ℹ 35 more variables: outtime <dttm>, los <dbl>, admittime <dttm>,
#   dischtime <dttm>, deathtime <dttm>, admission_type <chr>,
#   admit_provider_id <chr>, admission_location <chr>,
#   discharge_location <chr>, insurance <chr>, language <chr>,
#   marital_status <chr>, race <chr>, edregtime <dttm>, edouttime <dttm>,
#   hospital_expire_flag <dbl>, time_difference_hour <drtn>, gender <chr>,
#   anchor_age <dbl>, anchor_year <dbl>, anchor_year_group <chr>, dod <date>, …
mimic_icu_cohort = mimic_icu_cohort %>%
  mutate(age_intime = anchor_age + year(intime) - anchor_year)
mimic_icu_cohort = mimic_icu_cohort %>%
  filter(age_intime >= 18)

Q8. Exploratory data analysis (EDA)

Summarize the following information about the ICU stay cohort mimic_icu_cohort using appropriate numerics or graphs:

  • Length of ICU stay los vs demographic variables (race, insurance, marital_status, gender, age at intime)

  • Length of ICU stay los vs the last available lab measurements before ICU stay

  • Length of ICU stay los vs the first vital measurements within the ICU stay

  • Length of ICU stay los vs first ICU unit

Solution: Plots for Length of ICU stay los vs demographic variables (race, insurance, marital_status, gender, age at intime)

ggplot(data = mimic_icu_cohort, 
       aes(x = race, y = los)) + 
  geom_violin() +
  theme(axis.text.x = element_text(angle = 90, size = 6)) +
  labs(title = "Violin Plot for Race vs Length of Stay",
       x = "Race",
       y = "Length of Stay")

ggplot(data = mimic_icu_cohort, 
       aes(x = insurance, y = los)) + 
  geom_violin() + 
  labs(title = "Violin Plot for Insurance vs Length of Stay",
       x = "Insurance",
       y = "Length of Stay")

ggplot(data = mimic_icu_cohort, 
       aes(x = marital_status, y = los)) + 
  geom_violin() + 
  labs(title = "Violin Plot for Marital Status vs Length of Stay",
       x = "Marital Status",
       y = "Length of Stay")

ggplot(data = mimic_icu_cohort, 
       aes(x = gender, y = los)) + 
  geom_violin() +
  labs(title = "Violin Plot for Gender vs Length of Stay",
       x = "Gender",
       y = "Length of Stay")

ggplot(data = mimic_icu_cohort,
       aes(x = age_intime, y = los)) +
  geom_point() + 
  labs(title = "Scatterplot for In-time Age vs Length of Stay",
       x = "In-time Age",
       y = "Length of Stay")

Solution: Plots for Length of ICU stay los vs the last available lab measurements before ICU stay

ggplot(data = mimic_icu_cohort, 
       aes(x = bicarbonate,
           y = los)) + 
  geom_point() + 
  labs(title = "Scatterplot for Bicarbonate vs Length of Stay",
       x = "Bicarbonate", 
       y = "Length of Stay")

ggplot(data = mimic_icu_cohort, 
       aes(x = chloride,
           y = los)) + 
  geom_point() + 
  labs(title = "Scatterplot for Chloride vs Length of Stay",
       x = "Chloride", 
       y = "Length of Stay")

ggplot(data = mimic_icu_cohort, 
       aes(x = creatinine,
           y = los)) + 
  geom_point() + 
  labs(title = "Scatterplot for Creatinine vs Length of Stay",
       x = "Creatinine", 
       y = "Length of Stay")

ggplot(data = mimic_icu_cohort, 
       aes(x = glucose,
           y = los)) + 
  geom_point() + 
  labs(title = "Scatterplot for Glucose vs Length of Stay",
       x = "Glucose", 
       y = "Length of Stay")

ggplot(data = mimic_icu_cohort, 
       aes(x = potassium,
           y = los)) + 
  geom_point() + 
  labs(title = "Scatterplot for Potassium vs Length of Stay",
       x = "Potassium", 
       y = "Length of Stay")

ggplot(data = mimic_icu_cohort, 
       aes(x = sodium,
           y = los)) + 
  geom_point() + 
  labs(title = "Scatterplot for Sodium vs Length of Stay",
       x = "Sodium", 
       y = "Length of Stay")

ggplot(data = mimic_icu_cohort, 
       aes(x = hematocrit,
           y = los)) + 
  geom_point() + 
  labs(title = "Scatterplot for Hematocrit vs Length of Stay",
       x = "Hematocrit", 
       y = "Length of Stay")

ggplot(data = mimic_icu_cohort, 
       aes(x = white_blood_cell_count,
           y = los)) + 
  geom_point() + 
  labs(title = "Scatterplot for White Blood Cell Count vs Length of Stay",
       x = "White Blood Cell Count", 
       y = "Length of Stay")

Solution: Plots for Length of ICU stay los vs the first vital measurements within the ICU stay

ggplot(data = mimic_icu_cohort, 
       aes(x = heart_rate,
           y = los)) + 
  geom_point() + 
  labs(title = "Scatterplot for Heart Rate vs Length of Stay",
       x = "Heart Rate", 
       y = "Length of Stay")

ggplot(data = mimic_icu_cohort, 
       aes(x = `systolic_non-invasive_blood_pressure`,
           y = los)) + 
  geom_point() + 
  coord_cartesian(xlim = c(0, 260)) + 
  labs(title = "Scatterplot for Systolic Non-Invasive
       Blood Pressure vs Length of Stay",
       x = "Systolic Non-Invasive Blood Pressure", 
       y = "Length of Stay")

summary(mimic_icu_cohort$`systolic_non-invasive_blood_pressure`)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
     0.0    105.7    121.0    124.0    138.0 116114.0     1313 

I chose to limit the scale of x for the above graph because the outlier value of 116114 was making it hard to see the overall distribution of blood pressure. It should be noted that the outlier value is still there, even though it does not appear on this graph. The outlier value does not make any logical sense because it is not possible for a person’s blood pressure to give a value that high.

ggplot(data = mimic_icu_cohort, 
       aes(x = `diastolic_non-invasive_blood_pressure`,
           y = los)) + 
  geom_point() + 
  coord_cartesian(xlim = c(0, 260)) + 
  labs(title = "Scatterplot for Diastolic Non-Invasive
       Blood Pressure vs Length of Stay",
       x = "Diastolic Non-Invasive Blood Pressure", 
       y = "Length of Stay")

summary(mimic_icu_cohort$`diastolic_non-invasive_blood_pressure`)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
    0.00    57.50    68.00    72.15    79.30 70130.00     1318 

This graph also has some outliers in blood pressure values which throw off the overall graph. I chose to limit the x-axis for this graph too in order to see the distribution better. It should be noted that the outlier values are still present even though they cannot be seen in the graph. The extreme value for blood pressure also does not make logical sense; It is not possible for a human’s blood pressure to be that high.

ggplot(data = mimic_icu_cohort, 
       aes(x = `respiratory rate`,
           y = los)) + 
  geom_point() + 
  labs(title = "Scatterplot for Respiratory Rate vs Length of Stay",
       x = "Respiratory Rate", 
       y = "Length of Stay")

ggplot(data = mimic_icu_cohort, 
       aes(x = body_temperature_fahrenheit,
           y = los)) + 
  geom_point() + 
  labs(title = "Scatterplot for Body Temperature vs Length of Stay",
       x = "Body Temperature", 
       y = "Length of Stay")

The extreme values also throw off this graph a little, but you can still see the distribution of Body Temperature, so I have elected to show all the data points here. The extreme values of Body Temperature do not make any sense, which should be noted.

Solution: Graph for Length of ICU stay los vs first ICU unit

ggplot(data = mimic_icu_cohort,
       aes(x = first_careunit,
           y = los)) + 
  geom_boxplot() + 
  coord_flip() + 
  theme(axis.text.x = element_text(angle = 90)) + 
  labs(title = "First ICU Unit vs Length of Stay",
       x = "First ICU Unit",
       y = "Length of Stay")